label budget
Near-Exponential Savings for Mean Estimation with Active Learning
Morimoto, Julian M., Goldin, Jacob, Ho, Daniel E.
We study the problem of efficiently estimating the mean of a $k$-class random variable, $Y$, using a limited number of labels, $N$, in settings where the analyst has access to auxiliary information (i.e.: covariates) $X$ that may be informative about $Y$. We propose an active learning algorithm ("PartiBandits") to estimate $\mathbb{E}[Y]$. The algorithm yields an estimate, $\widehatμ_{\text{PB}}$, such that $\left( \widehatμ_{\text{PB}} - \mathbb{E}[Y]\right)^2$ is $\tilde{\mathcal{O}}\left( \frac{ν+ \exp(c \cdot (-N/\log(N))) }{N} \right)$, where $c > 0$ is a constant and $ν$ is the risk of the Bayes-optimal classifier. PartiBandits is essentially a two-stage algorithm. In the first stage, it learns a partition of the unlabeled data that shrinks the average conditional variance of $Y$. In the second stage it uses a UCB-style subroutine ("WarmStart-UCB") to request labels from each stratum round-by-round. Both the main algorithm's and the subroutine's convergence rates are minimax optimal in classical settings. PartiBandits bridges the UCB and disagreement-based approaches to active learning despite these two approaches being designed to tackle very different tasks. We illustrate our methods through simulation using nationwide electronic health records. Our methods can be implemented using the PartiBandits package in R.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (3 more...)
Uncertainty-Guided Expert-AI Collaboration for Efficient Soil Horizon Annotation
Chiaburu, Teodor, Singh, Vipin, Haußer, Frank, Bießmann, Felix
Uncertainty quantification is essential in human-machine collaboration, as human agents tend to adjust their decisions based on the confidence of the machine counterpart. Reliably calibrated model uncertainties, hence, enable more effective collaboration, targeted expert intervention and more responsible usage of Machine Learning (ML) systems. Conformal prediction has become a well established model-agnostic framework for uncertainty calibration of ML models, offering statistically valid confidence estimates for both regression and classification tasks. In this work, we apply conformal prediction to $\textit{SoilNet}$, a multimodal multitask model for describing soil profiles. We design a simulated human-in-the-loop (HIL) annotation pipeline, where a limited budget for obtaining ground truth annotations from domain experts is available when model uncertainty is high. Our experiments show that conformalizing SoilNet leads to more efficient annotation in regression tasks and comparable performance scores in classification tasks under the same annotation budget when tested against its non-conformal counterpart. All code and experiments can be found in our repository: https://github.com/calgo-lab/BGR
- North America > United States > New York > New York County > New York City (0.14)
- Europe > Germany > Berlin (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (4 more...)
Matched-Pair Experimental Design with Active Learning
Li, Weizhi, Dasarathy, Gautam, Berisha, Visar
Matched-pair experimental designs aim to detect treatment effects by pairing participants and comparing within-pair outcome differences. In many situations, the overall effect size across the entire population is small. Then, the focus naturally shifts to identifying and targeting high treatment-effect regions where the intervention is most effective. This paper proposes a matched-pair experimental design that sequentially and actively enrolls patients in high treatment-effect regions. Importantly, we frame the identification of the target region as a classification problem and propose an active learning framework tailored to matched-pair designs. Our design not only reduces the experimental cost of detecting treatment efficacy, but also ensures that the identified regions enclose the entire high-treatment-effect regions. Our theoretical analysis of the framework's label complexity and experiments in practical scenarios demonstrate the efficiency and advantages of the approach.
- North America > United States > Arizona (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.93)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.67)
- Health & Medicine > Public Health (0.67)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)
Label-Efficient Grasp Joint Prediction with Point-JEPA
Guzelkabaagac, Jed, Petrović, Boris
Abstract--We study whether 3D self-supervised pretraining with Point-JEPA enables label-efficient grasp joint-angle prediction. Meshes are sampled to point clouds and tokenized; a ShapeNet-pretrained Point-JEPA encoder feeds a K=5 multi-hypothesis head trained with winner-takes-all and evaluated by top-logit selection. On a multi-finger hand dataset with strict object-level splits, Point-JEPA improves top-logit RMSE and Coverage@15 in low-label regimes (e.g., 26% lower RMSE at 25% data) and reaches parity at full supervision, suggesting JEPA-style pretraining is a practical lever for data-efficient grasp learning. Self-supervised learning (SSL) for 3D data has largely progressed along three directions. On point clouds this includes point/voxel masked autoencoding; e.g., V oxel-MAE reconstructs masked voxels for sparse automotive LiDAR and improves downstream tasks with fewer labels [1]-[4].
- North America > United States (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Active, anytime-valid risk controlling prediction sets
Rigorously establishing the safety of black-box machine learning models with respect to critical risk measures is important for providing guarantees about the behavior of the model.Recently, a notion of a risk controlling prediction set (RCPS) has been introduced by Bates et. Further, we propose a framework for constructing RCPSes for active labeling, i.e., allowing one to use a labeling policy that chooses whether to query the true label for each received data point, and ensures the expected proportion data points whose labels are queried are below a predetermined label budget. We also describe how to use predictors (e.g., the machine learning model we are providing risk control guarantees for) to further improve the utility of our RCPSes by estimating the expected risk conditioned on the covariates.We characterize the optimal choices of label policy under a fixed label budget, and predictor, and show a regret result that relates the estimation error of the optimal labeling policy and predictor to the wealth process that underlies our RCPSes.Lastly, we present practical ways of formulating label policies and we empirically show that our label policies use fewer labels to reach higher utility than naive baseline labeling strategies on both simulations and real data.
Pruning-based Data Selection and Network Fusion for Efficient Deep Learning
Kousar, Humaira, Bhatti, Hasnain Irshad, Moon, Jaekyun
Efficient data selection is essential for improving the training efficiency of deep neural networks and reducing the associated annotation costs. However, traditional methods tend to be computationally expensive, limiting their scalability and real-world applicability. We introduce PruneFuse, a novel method that combines pruning and network fusion to enhance data selection and accelerate network training. In PruneFuse, the original dense network is pruned to generate a smaller surrogate model that efficiently selects the most informative samples from the dataset. Once this iterative data selection selects sufficient samples, the insights learned from the pruned model are seamlessly integrated with the dense model through network fusion, providing an optimized initialization that accelerates training. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Active Subsampling for Measurement-Constrained M-Estimation of Individualized Thresholds with High-Dimensional Data
In the measurement-constrained problems, despite the availability of large datasets, we may be only affordable to observe the labels on a small portion of the large dataset. This poses a critical question that which data points are most beneficial to label given a budget constraint. In this paper, we focus on the estimation of the optimal individualized threshold in a measurement-constrained M-estimation framework. Our goal is to estimate a high-dimensional parameter $\theta$ in a linear threshold $\theta^T Z$ for a continuous variable $X$ such that the discrepancy between whether $X$ exceeds the threshold $\theta^T Z$ and a binary outcome $Y$ is minimized. We propose a novel $K$-step active subsampling algorithm to estimate $\theta$, which iteratively samples the most informative observations and solves a regularized M-estimator. The theoretical properties of our estimator demonstrate a phase transition phenomenon with respect to $\beta\geq 1$, the smoothness of the conditional density of $X$ given $Y$ and $Z$. For $\beta>(1+\sqrt{3})/2$, we show that the two-step algorithm yields an estimator with the parametric convergence rate $O_p((s \log d /N)^{1/2})$ in $l_2$ norm. The rate of our estimator is strictly faster than the minimax optimal rate with $N$ i.i.d. samples drawn from the population. For the other two scenarios $1<\beta\leq (1+\sqrt{3})/2$ and $\beta=1$, the estimator from the two-step algorithm is sub-optimal. The former requires to run $K>2$ steps to attain the same parametric rate, whereas in the latter case only a near parametric rate can be obtained. Furthermore, we formulate a minimax framework for the measurement-constrained M-estimation problem and prove that our estimator is minimax rate optimal up to a logarithmic factor. Finally, we demonstrate the performance of our method in simulation studies and apply the method to analyze a large diabetes dataset.
- Research Report > Experimental Study (0.93)
- Workflow (0.93)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.65)
- Health & Medicine > Health Care Technology (0.45)
Neural Active Learning Beyond Bandits
Ban, Yikun, Agarwal, Ishika, Wu, Ziwei, Zhu, Yada, Weldemariam, Kommy, Tong, Hanghang, He, Jingrui
We study both stream-based and pool-based active learning with neural network approximations. A recent line of works proposed bandit-based approaches that transformed active learning into a bandit problem, achieving both theoretical and empirical success. However, the performance and computational costs of these methods may be susceptible to the number of classes, denoted as K, due to this transformation. Therefore, this paper seeks to answer the question: "How can we mitigate the adverse impacts of K while retaining the advantages of principled exploration and provable performance guarantees in active learning?" To tackle this challenge, we propose two algorithms based on the newly designed exploitation and exploration neural networks for stream-based and pool-based active learning. Subsequently, we provide theoretical performance guarantees for both algorithms in a non-parametric setting, demonstrating a slower error-growth rate concerning K for the proposed approaches. We use extensive experiments to evaluate the proposed algorithms, which consistently outperform state-of-the-art baselines. Active learning is one of the primary areas in machine learning to investigate the learning technique on a small subset of labeled data while acquiring good generalization performance compared to passive learning [19]. There are mainly two settings of active learning: stream-based and pool-based settings.
- Asia > Middle East > Jordan (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (2 more...)
Label Budget Allocation in Multi-Task Learning
Sun, Ximeng, Sohn, Kihyuk, Saenko, Kate, Mellina, Clayton, Bian, Xiao
The cost of labeling data often limits the performance of machine learning systems. In multi-task learning, related tasks provide information to each other and improve overall performance, but the label cost can vary among tasks. How should the label budget (i.e. the amount of money spent on labeling) be allocated among different tasks to achieve optimal multi-task performance? We are the first to propose and formally define the label budget allocation problem in multi-task learning and to empirically show that different budget allocation strategies make a big difference to its performance. We propose a Task-Adaptive Budget Allocation algorithm to robustly generate the optimal budget allocation adaptive to different multi-task learning settings. Specifically, we estimate and then maximize the extent of new information obtained from the allocated budget as a proxy for multi-task learning performance. Experiments on PASCAL VOC and Taskonomy demonstrate the efficacy of our approach over other widely used heuristic labeling strategies.